This document explores Tesla’s stock prices from its initial public offering on June 29, 2010, to December 31, 2019. Tesla’s share prices have experienced staggering increases over several years and machine learning algorithms could unveil trends of seasonality or autocorrelated variables. Predictive analysis, and within it, machine learning, can greatly influence investors from the uncertainty of the market. This time-series data aims to address the ability of machine learning models to use the time-series data to predict Tesla’s future market behavior. This investigation will utilize machine learning techniques on the historical prices to evaluate the direction of market movement, as well as forecast the value of the stock in the future. The research questions are as follows:

  1. How accurate is the ARIMA model with the addition of technical indicators from historical data?
  2. Which attribute(s) best predicts the direction of the stock market movement?

The future predictions will be compared to present day stock value to determine the accuracy of the algorithms, in addition to a review of the factors that have affected the stock prices to date (i.e. supply and demand, economy, stock splits, etc.).

Importing Libraries and Data

library(quantmod) 
library(dplyr)
library(lubridate)
library(summarytools)
library(dvmisc)
library(corrplot)
library(tseries)
library(forecast)
library(ggplot2)
library(plotly)
library(caret)
library(formattable)
library(dygraphs)
library(hrbrthemes)
library(TSstudio)
library(FSelector)

stock_list <- "TSLA"
start_date <- as.Date("2010-06-29")
end_date <- as.Date("2020-01-01")
tesla <- NULL

for (i in seq(length(stock_list))){
  getSymbols(stock_list, verbose = FALSE, src = "yahoo", 
             from=start_date,to=end_date)
  temp_df = as.data.frame(get(stock_list))
  temp_df$Date = row.names(temp_df)
  row.names(temp_df) = NULL
  colnames(temp_df) = c("Open", "High", "Low", "Close", 
                        "Volume", "Adjusted", "Date")
  temp_df = temp_df[c("Date", "Open", "High", 
                      "Low", "Close", "Volume", "Adjusted")]
  tesla = temp_df
}

In considering the COVID-19 pandemic, the Tesla dataset was only selected to include the years from 2010-2019. Due to the uncertainty surrounding the economy in 2020, I believe there will be white noise present in the 2020 data.

The following visualization provides insight into key historical events that have impacted the price of Tesla’s stock.


Data Preparation and Exploration

The data analysis stage first consists of cleaning, and inspecting the data for inconsistencies. Following these steps, the data may undergo transformations and modelling as required. As part of the data preparation stage, the following steps will be taken:

head(tesla)
##         Date  Open  High   Low Close   Volume Adjusted
## 1 2010-06-29 3.800 5.000 3.508 4.778 93831500    4.778
## 2 2010-06-30 5.158 6.084 4.660 4.766 85935500    4.766
## 3 2010-07-01 5.000 5.184 4.054 4.392 41094000    4.392
## 4 2010-07-02 4.600 4.620 3.742 3.840 25699000    3.840
## 5 2010-07-06 4.000 4.000 3.166 3.222 34334500    3.222
## 6 2010-07-07 3.280 3.326 2.996 3.160 34608500    3.160
str(tesla)
## 'data.frame':    2394 obs. of  7 variables:
##  $ Date    : chr  "2010-06-29" "2010-06-30" "2010-07-01" "2010-07-02" ...
##  $ Open    : num  3.8 5.16 5 4.6 4 ...
##  $ High    : num  5 6.08 5.18 4.62 4 ...
##  $ Low     : num  3.51 4.66 4.05 3.74 3.17 ...
##  $ Close   : num  4.78 4.77 4.39 3.84 3.22 ...
##  $ Volume  : num  93831500 85935500 41094000 25699000 34334500 ...
##  $ Adjusted: num  4.78 4.77 4.39 3.84 3.22 ...
tesla$Date <- as.Date(tesla$Date) 
class(tesla$Date)
## [1] "Date"

The ‘Date’ attribute was changed to represent a date type variable.

sum(is.na(tesla))
## [1] 0

There are no missing values.

Next, a correlation plot will determine whether the attributes are correlated and to what degree.

##               Open      High       Low     Close    Volume  Adjusted
## Open     1.0000000 0.9995588 0.9995524 0.9990113 0.4618287 0.9990113
## High     0.9995588 1.0000000 0.9994779 0.9996208 0.4710375 0.9996208
## Low      0.9995524 0.9994779 1.0000000 0.9995621 0.4526298 0.9995621
## Close    0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000
## Volume   0.4618287 0.4710375 0.4526298 0.4624999 1.0000000 0.4624999
## Adjusted 0.9990113 0.9996208 0.9995621 1.0000000 0.4624999 1.0000000

To understand the attributes further, the measures of central tendency will be reviewed.


Descriptive Statistics of Tesla Stock

Min Q1 Median Mean Q3 Max Std.Dev
Open 3.23 6.85 42.42 36.62 52.80 87.00 22.89
High 3.33 6.96 43.19 37.26 53.55 87.06 23.24
Low 3.00 6.70 41.61 35.96 52.02 85.27 22.52
Close 3.16 6.87 42.32 36.63 52.78 86.19 22.90
Volume (M) 0.59 9.37 22.65 27.17 36.26 185.82 23.60
Adjusted 3.16 6.87 42.32 36.63 52.78 86.19 22.90


These values allow us to see the range of values that are present in the Tesla stocks over time. Notably, the range of the minimum and maximum stock values is quite large, likely due to trends over several years. Since this is a time-series dataset from the intial public offering, it is unlikely that outliers are present, since the value of the stock has changed drastically over several years.

For the purposes of the forecasting investigation, the closed stock price (i.e. the value of the stock at the end of the day) will be used. The table below displays the trends of the closed stock prices from 2010-2019.


Closed Tesla Stock Price Statistics

Year Min Max Average % Change per Fiscal Year
2010 3.16 7.09 4.67 11.47
2011 4.37 6.99 5.36 7.29
2012 4.56 7.60 6.23 20.62
2013 6.58 38.67 20.88 325.42
2014 27.87 57.21 44.67 48.17
2015 37.00 56.45 46.01 9.44
2016 28.73 53.08 41.95 -4.35
2017 43.40 77.00 62.86 43.49
2018 50.11 75.91 63.46 3.83
2019 35.79 86.19 54.71 34.89


From this point, the data will be split into training and test sets. There will be two training and test sets- one for the ARIMA model and one for KNN. KNN requires a response variable to be calculated prior to splitting the data, to track the trends in the market. In the case of reviewing the trends of the stock market, the stock’s return will be categorized into one of five classes. These classes will represent the quartiles of the daily returns.

Further analysis will be conducted on the training set.

# Setting training and test sets for ARIMA
train.set <- tesla %>%
  filter(Date >= as.Date("2010/06/29") & Date <= as.Date("2016/12/31"))

test.set <- tesla %>%
  filter(Date >= as.Date("2017/01/01") & Date <= as.Date("2019/12/31"))


#KNN
# Generating the daily return
for (i in 2:nrow(tesla)){
  tesla$Return[i] <- 
    (tesla$Close[i] - tesla$Close[i-1])/tesla$Close[i-1] 
}

# Assigning a response variable
normalize <- function(x){
  return ((x - min(x)) / (max(x) - min(x)))
}

shift <- function(x, n){
  c(x[-(seq(n))], rep(NA, n))
}

tesla$ReturnNext <- shift(tesla$Return, 1)
tesla <- na.omit(tesla)
tesla$Response <- normalize(tesla[,8]) 
tesla$Response <- quant_groups(tesla$Return, groups = 5)
## Observations per group: 479, 478, 478, 478, 479. 0 missing.
summary(tesla$Response)
##   [-0.193,-0.0192] (-0.0192,-0.00434] (-0.00434,0.00659] 
##                479                478                478 
##   (0.00659,0.0221]     (0.0221,0.244] 
##                478                479
# Setting training and test sets for KNN
train.set.knn <- tesla %>%
  filter(Date >= as.Date("2010/06/30") & Date <= as.Date("2016/12/31"))

test.set.knn <- tesla %>%
  filter(Date >= as.Date("2017/01/01") & Date <= as.Date("2019/12/31"))


Next, visualizations will be used to observe trends in the data. In order to ensure that the most accurate forecasts are obtained from the analysis, there are several aspects to consider when working with a time series dataset. The following data exploration will determine:

The first two visualizations will display the closing price of Tesla stock per day.



Histogram of Closing Price

Now that the closing stock prices have been visualized, it is clear that the data is not normally distributed. Further tesing for stationarity is shown below.


Testing for Stationary:

The Augmented Dickey-Fuller Test is a statistical test that determines if the data is stationary. A p-value of less than 0.05 is indicative of stationary data. With stationary data there is a constant mean and constant variance.

adf.test(train.set$Close)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train.set$Close
## Dickey-Fuller = -2.0171, Lag order = 11, p-value = 0.5711
## alternative hypothesis: stationary

Based on the p-value of 0.5711, the Tesla time-series is non-stationary. Differencing will be used to adjust for this discovery. The inbuilt function of ndiffs will calculate the number of differences required to make the time-series data stationary. Differencing can be completed using the diff function. The visualizations below will be used to determine which transformation is best suited for the data. The differenced data will undergo the Augmented Dickey-Fuller Test again.

ndiffs(train.set$Close, test = "adf")
## [1] 1

Based on the plots above, the log differenced data appears to be best performing metric to stabilize the variance of the time series.

adf.test(diff(log(train.set$Close)))
## Warning in adf.test(diff(log(train.set$Close))): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(log(train.set$Close))
## Dickey-Fuller = -11.213, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Based on the differencing of 1 lag and a logarithmic transformation, the time series is now a stationary process.


Autocorrelation:

It is important to determine if autocorrelation is present within the data. Autocorrelation refers to the degree of linear similarity that is present between the data and a lagged version of the past data. In other words, this assessment determine if the data is previous data influenced the current observations. The following tests will compare the original and stationary data for the presence of autocorrelation.

The above plot (right) shows that the autocorrelation falls to zero quickly, meaning that the data is now stationary. The autocorrelation is within the dashed blue line which indicates that there is no longer a lag that is correlated with the data series. The partical autocorrelation plot below confirms this conclusion.

pacf(diff(log(train.set$Close)), main = " PACF Series: Log Differenced Price")

Since the log differenced closing price results in a stationary dataset, the values will be saved to a new variable to use in the modelling.

train_diff <- train.set %>% 
  select(Date, Open, High, Low, Close, Adjusted) %>% 
  mutate(Logdiff = c(NA, diff(log(Close)))) %>% 
  na.omit() 


Presence of Seasonality:

Multiplicative decomposition is common with economic time series data. However, a log differenced transformation was applied to the dataset. This transformation stabilized the variation in the series, making it is possible to use an additive decomposition.

##  The tesla_ts series is a xts object with 1 variable and 1639 observations
##  Frequency: daily 
##  Start time: 2010-06-30 
##  End time: 2016-12-30

There does not appear to be a seasonal trend with this dataset.


Normalization:

train_norm <- normalize(train.set.knn[,c(2:5)])
train_norm$Return <- train.set.knn$Return
train_norm$Response <- train.set.knn$Response

Technical Indicators

The ARIMA model requires additional technical indicators to be calculated to strengthen the available information. These indicators include the following:

for (i in 1:nrow(train_diff)){
  #MACD
  train_diff$MACD <- MACD(Cl(train_diff), nFast=12, nSlow=26, nSig=9)
  #RSI
  train_diff$RSI <- RSI(Cl(train_diff), n=14)
  #Price Rate of Change
  train_diff$ROC <- ROC(Cl(train_diff),n=14) * 100
  #Simple Moving Average
  train_diff$SMA <- SMA(Cl(train_diff),n=14)
  
  hlac <- data.frame(x=Hi(train_diff), y=Lo(train_diff), z=Cl(train_diff))
  
  #Stochastic Oscillator
  train_diff$STO <- stoch(hlac, nFastK = 14) *100
  #William's %R
  train_diff$WPR <- WPR(hlac, n=14) * (-100)

}

Modelling

ARIMA model:

Prior to determining how accurate the ARIMA model is with the addition of technical indicators, a base model will be built from the log differenced closing prices. The base model will be constructed with two information criterion methods, AIC and BIC, and the best method will be selected for more analysis. This model will provide a point of comparison for the second ARIMA model.

Base Model

# AIC
set.seed(1)
arima_model <- auto.arima(train_diff$Logdiff, ic = "aic")
summary(arima_model)
## Series: train_diff$Logdiff 
## ARIMA(2,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     ma3    mean
##       1.6528  -0.7026  -1.6384  0.6348  0.0534  0.0014
## s.e.  0.1960   0.1778   0.1977  0.1836  0.0270  0.0008
## 
## sigma^2 estimated as 0.001091:  log likelihood=3266.96
## AIC=-6519.92   AICc=-6519.85   BIC=-6482.11
## 
## Training set error measures:
##                         ME       RMSE      MAE MPE MAPE      MASE
## Training set -5.927867e-05 0.03296829 0.022741 NaN  Inf 0.6918805
##                     ACF1
## Training set 0.004955865
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,3) with non-zero mean
## Q* = 6.2967, df = 4, p-value = 0.1781
## 
## Model df: 6.   Total lags used: 10
forecast1 <- forecast(arima_model, h = 30)
accuracy(forecast1, test=test.set$Close)
##                         ME       RMSE       MAE  MPE MAPE      MASE
## Training set -0.0007987901 0.02912411 0.0237286 -Inf  Inf 0.9925052
##                   ACF1
## Training set 0.2355389
# BIC
set.seed(2)
arima_modelb <- auto.arima(train_diff$Logdiff, ic = "bic")
summary(arima_modelb)
## Series: train_diff$Logdiff 
## ARIMA(3,0,1) with zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1
##       0.5147  -0.0476  0.0081  -0.4923
## s.e.  3.0078   0.0744  0.1187   3.0095
## 
## sigma^2 estimated as 0.001093:  log likelihood=3264.49
## AIC=-6518.97   AICc=-6518.93   BIC=-6491.96
## 
## Training set error measures:
##                       ME       RMSE        MAE MPE MAPE      MASE
## Training set 0.001382708 0.03301828 0.02275106 NaN  Inf 0.6921865
##                      ACF1
## Training set -0.001743198
checkresiduals(arima_modelb)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,1) with zero mean
## Q* = 5.5268, df = 6, p-value = 0.4782
## 
## Model df: 4.   Total lags used: 10
forecast1b <- forecast(arima_modelb, h = 30)
accuracy(forecast1b, test = test.set$Close)
##                        ME       RMSE        MAE MPE MAPE      MASE
## Training set 0.0007141997 0.02921658 0.02380201 NaN  Inf 0.9955757
##                   ACF1
## Training set 0.2356334

The AIC method found a better fit for the model. The forecast of future stock prices is visualized below.

Revised Model

set.seed(3)
arima_model2 <- auto.arima(train_diff$Logdiff, ic = "aic", xreg = train_diff$ROC)
summary(arima_model2)
## Series: train_diff$Logdiff 
## Regression with ARIMA(0,0,0) errors 
## 
## Coefficients:
##        xreg
##       7e-04
## s.e.  1e-04
## 
## sigma^2 estimated as 0.0009602:  log likelihood=3333.31
## AIC=-6662.62   AICc=-6662.61   BIC=-6651.83
## 
## Training set error measures:
##                         ME       RMSE       MAE MPE MAPE      MASE
## Training set -8.118361e-05 0.03111061 0.0219327 NaN  Inf 0.6672884
##                      ACF1
## Training set -0.009015989
checkresiduals(arima_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,0,0) errors
## Q* = 28.191, df = 9, p-value = 0.0008862
## 
## Model df: 1.   Total lags used: 10
forecast2 <- forecast(arima_model2, xreg = train_diff$ROC, h=30)
## Warning in forecast.forecast_ARIMA(arima_model2, xreg = train_diff$ROC, :
## Upper prediction intervals are not finite.
accuracy(forecast2, test = test.set$Close)
##                         ME       RMSE        MAE MPE MAPE      MASE
## Training set -0.0009098155 0.02906868 0.02384551 NaN  Inf 0.9973949
##                   ACF1
## Training set 0.2431704

The revised model and the resulting forecasts can be viewed below.

K-Nearest Neighbours:

set.seed(100)
knn_model <- train(Response ~ High + Low + Open + Close, 
                   data = train_norm, 
                   method = "knn")
                   
predictions <- predict(knn_model, newdata = test.set.knn)
confusionMatrix(predictions, test.set.knn$Response, mode = "everything")
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           [-0.193,-0.0192] (-0.0192,-0.00434]
##   [-0.193,-0.0192]                  0                  0
##   (-0.0192,-0.00434]               60                 66
##   (-0.00434,0.00659]               35                 32
##   (0.00659,0.0221]                 58                 50
##   (0.0221,0.244]                    0                  0
##                     Reference
## Prediction           (-0.00434,0.00659] (0.00659,0.0221] (0.0221,0.244]
##   [-0.193,-0.0192]                    0                0              0
##   (-0.0192,-0.00434]                 64               69             52
##   (-0.00434,0.00659]                 26               45             32
##   (0.00659,0.0221]                   59               45             60
##   (0.0221,0.244]                      0                0              0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.1819         
##                  95% CI : (0.155, 0.2114)
##     No Information Rate : 0.2112         
##     P-Value [Acc > NIR] : 0.9792         
##                                          
##                   Kappa : -0.0253        
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: [-0.193,-0.0192] Class: (-0.0192,-0.00434]
## Sensitivity                           0.0000                   0.44595
## Specificity                           1.0000                   0.59504
## Pos Pred Value                           NaN                   0.21222
## Neg Pred Value                        0.7968                   0.81448
## Precision                                 NA                   0.21222
## Recall                                0.0000                   0.44595
## F1                                        NA                   0.28758
## Prevalence                            0.2032                   0.19655
## Detection Rate                        0.0000                   0.08765
## Detection Prevalence                  0.0000                   0.41301
## Balanced Accuracy                     0.5000                   0.52049
##                      Class: (-0.00434,0.00659] Class: (0.00659,0.0221]
## Sensitivity                            0.17450                 0.28302
## Specificity                            0.76159                 0.61785
## Pos Pred Value                         0.15294                 0.16544
## Neg Pred Value                         0.78902                 0.76299
## Precision                              0.15294                 0.16544
## Recall                                 0.17450                 0.28302
## F1                                     0.16301                 0.20882
## Prevalence                             0.19788                 0.21116
## Detection Rate                         0.03453                 0.05976
## Detection Prevalence                   0.22576                 0.36122
## Balanced Accuracy                      0.46804                 0.45043
##                      Class: (0.0221,0.244]
## Sensitivity                         0.0000
## Specificity                         1.0000
## Pos Pred Value                         NaN
## Neg Pred Value                      0.8088
## Precision                               NA
## Recall                              0.0000
## F1                                      NA
## Prevalence                          0.1912
## Detection Rate                      0.0000
## Detection Prevalence                0.0000
## Balanced Accuracy                   0.5000
knn_model
## k-Nearest Neighbors 
## 
## 1639 samples
##    4 predictor
##    5 classes: '[-0.193,-0.0192]', '(-0.0192,-0.00434]', '(-0.00434,0.00659]', '(0.00659,0.0221]', '(0.0221,0.244]' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1639, 1639, 1639, 1639, 1639, 1639, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.3878874  0.2347137
##   7  0.3868514  0.2335506
##   9  0.3876914  0.2345537
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Feature Selection

Feature selection will be used to find the best model for the classification algorithm. This process will determine which attribute(s) best predicts the response of the next day’s market. The ARIMA model uses the Akaike information criterion (AIC) to best fit the model. Two other feature selection methods, a filter-based and wrapper-based technique are shown below.

# Filter-based Feature Selection Technique
# Correlation Based
cfs(Response ∼ High + Low + Open + Close, data = train_norm)
## [1] "High"
# Wrapper-based Feature Selection Technique
full.model = glm(Response∼High + Low + Open + Close, data = train_norm, family = binomial())
step(full.model, data = train_norm, direction = "backward")
## Start:  AIC=997.84
## Response ~ High + Low + Open + Close
## 
##         Df Deviance     AIC
## <none>       987.84  997.84
## - Low    1   990.83  998.83
## - High   1  1012.72 1020.72
## - Open   1  1099.84 1107.84
## - Close  1  1143.51 1151.51
## 
## Call:  glm(formula = Response ~ High + Low + Open + Close, family = binomial(), 
##     data = train_norm)
## 
## Coefficients:
## (Intercept)         High          Low         Open        Close  
##       1.476     -110.978       36.150     -194.465      275.463  
## 
## Degrees of Freedom: 1638 Total (i.e. Null);  1634 Residual
## Null Deviance:       1635 
## Residual Deviance: 987.8     AIC: 997.8

Performance Evaluation

From the feature selection stage, the best fitting model will now be built and compared to the original model.

set.seed(1000)
knn_model2 <- train(Response ~ High, 
                   data = train_norm, 
                   method = "knn")

knn_model2                   
## k-Nearest Neighbors 
## 
## 1639 samples
##    1 predictor
##    5 classes: '[-0.193,-0.0192]', '(-0.0192,-0.00434]', '(-0.00434,0.00659]', '(0.00659,0.0221]', '(0.0221,0.244]' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1639, 1639, 1639, 1639, 1639, 1639, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa        
##   5  0.2050098   0.0068492386
##   7  0.2022920   0.0035853912
##   9  0.1991225  -0.0002994579
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
predictions2 <- predict(knn_model2, test.set.knn)
confusionMatrix(predictions2, test.set.knn$Response, mode = "everything")
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           [-0.193,-0.0192] (-0.0192,-0.00434]
##   [-0.193,-0.0192]                  0                  0
##   (-0.0192,-0.00434]               55                 48
##   (-0.00434,0.00659]                0                  0
##   (0.00659,0.0221]                 53                 47
##   (0.0221,0.244]                   45                 53
##                     Reference
## Prediction           (-0.00434,0.00659] (0.00659,0.0221] (0.0221,0.244]
##   [-0.193,-0.0192]                    0                0              0
##   (-0.0192,-0.00434]                 44               49             53
##   (-0.00434,0.00659]                  0                0              0
##   (0.00659,0.0221]                   49               53             46
##   (0.0221,0.244]                     56               57             45
## 
## Overall Statistics
##                                          
##                Accuracy : 0.1939         
##                  95% CI : (0.1662, 0.224)
##     No Information Rate : 0.2112         
##     P-Value [Acc > NIR] : 0.8868         
##                                          
##                   Kappa : -0.0071        
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: [-0.193,-0.0192] Class: (-0.0192,-0.00434]
## Sensitivity                           0.0000                   0.32432
## Specificity                           1.0000                   0.66777
## Pos Pred Value                           NaN                   0.19277
## Neg Pred Value                        0.7968                   0.80159
## Precision                                 NA                   0.19277
## Recall                                0.0000                   0.32432
## F1                                        NA                   0.24181
## Prevalence                            0.2032                   0.19655
## Detection Rate                        0.0000                   0.06375
## Detection Prevalence                  0.0000                   0.33068
## Balanced Accuracy                     0.5000                   0.49605
##                      Class: (-0.00434,0.00659] Class: (0.00659,0.0221]
## Sensitivity                             0.0000                 0.33333
## Specificity                             1.0000                 0.67172
## Pos Pred Value                             NaN                 0.21371
## Neg Pred Value                          0.8021                 0.79010
## Precision                                   NA                 0.21371
## Recall                                  0.0000                 0.33333
## F1                                          NA                 0.26044
## Prevalence                              0.1979                 0.21116
## Detection Rate                          0.0000                 0.07039
## Detection Prevalence                    0.0000                 0.32935
## Balanced Accuracy                       0.5000                 0.50253
##                      Class: (0.0221,0.244]
## Sensitivity                        0.31250
## Specificity                        0.65353
## Pos Pred Value                     0.17578
## Neg Pred Value                     0.80080
## Precision                          0.17578
## Recall                             0.31250
## F1                                 0.22500
## Prevalence                         0.19124
## Detection Rate                     0.05976
## Detection Prevalence               0.33997
## Balanced Accuracy                  0.48302
set.seed(1001)
knn_model3 <- train(Response ~ Low, 
                   data = train_norm, 
                   method = "knn")

knn_model3                   
## k-Nearest Neighbors 
## 
## 1639 samples
##    1 predictor
##    5 classes: '[-0.193,-0.0192]', '(-0.0192,-0.00434]', '(-0.00434,0.00659]', '(0.00659,0.0221]', '(0.0221,0.244]' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1639, 1639, 1639, 1639, 1639, 1639, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa       
##   5  0.2038097   0.005434335
##   7  0.2007406   0.001828813
##   9  0.1971804  -0.002522338
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
predictions3 <- predict(knn_model3, test.set.knn)
confusionMatrix(predictions3, test.set.knn$Response, mode = "everything")
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           [-0.193,-0.0192] (-0.0192,-0.00434]
##   [-0.193,-0.0192]                  0                  0
##   (-0.0192,-0.00434]                7                 10
##   (-0.00434,0.00659]              131                124
##   (0.00659,0.0221]                 15                 14
##   (0.0221,0.244]                    0                  0
##                     Reference
## Prediction           (-0.00434,0.00659] (0.00659,0.0221] (0.0221,0.244]
##   [-0.193,-0.0192]                    0                0              0
##   (-0.0192,-0.00434]                 13               13             16
##   (-0.00434,0.00659]                118              132            119
##   (0.00659,0.0221]                   18               14              9
##   (0.0221,0.244]                      0                0              0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1886          
##                  95% CI : (0.1612, 0.2184)
##     No Information Rate : 0.2112          
##     P-Value [Acc > NIR] : 0.9425          
##                                           
##                   Kappa : -0.013          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: [-0.193,-0.0192] Class: (-0.0192,-0.00434]
## Sensitivity                           0.0000                   0.06757
## Specificity                           1.0000                   0.91901
## Pos Pred Value                           NaN                   0.16949
## Neg Pred Value                        0.7968                   0.80115
## Precision                                 NA                   0.16949
## Recall                                0.0000                   0.06757
## F1                                        NA                   0.09662
## Prevalence                            0.2032                   0.19655
## Detection Rate                        0.0000                   0.01328
## Detection Prevalence                  0.0000                   0.07835
## Balanced Accuracy                     0.5000                   0.49329
##                      Class: (-0.00434,0.00659] Class: (0.00659,0.0221]
## Sensitivity                             0.7919                 0.08805
## Specificity                             0.1623                 0.90572
## Pos Pred Value                          0.1891                 0.20000
## Neg Pred Value                          0.7597                 0.78770
## Precision                               0.1891                 0.20000
## Recall                                  0.7919                 0.08805
## F1                                      0.3053                 0.12227
## Prevalence                              0.1979                 0.21116
## Detection Rate                          0.1567                 0.01859
## Detection Prevalence                    0.8287                 0.09296
## Balanced Accuracy                       0.4771                 0.49689
##                      Class: (0.0221,0.244]
## Sensitivity                         0.0000
## Specificity                         1.0000
## Pos Pred Value                         NaN
## Neg Pred Value                      0.8088
## Precision                               NA
## Recall                              0.0000
## F1                                      NA
## Prevalence                          0.1912
## Detection Rate                      0.0000
## Detection Prevalence                0.0000
## Balanced Accuracy                   0.5000

Future work can also evaluate the use of training control measures for KNN. For the time series data, it is not possible to use resampling techniques such as cross validation. Instead, a sliding window can be created.

set.seed(100)
control <- trainControl(method = "timeslice",
                              initialWindow = 40,
                              horizon = 20,
                              fixedWindow = TRUE)

knn_model4 <- train(Response ~ High + Low + Open + Close, 
                   data = train_norm, 
                   method = "knn",
                   trControl = control)

predictions2 <- predict(knn_model4, newdata = test.set.knn)
confusionMatrix(predictions2, test.set.knn$Response, mode = "everything")
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           [-0.193,-0.0192] (-0.0192,-0.00434]
##   [-0.193,-0.0192]                  0                  0
##   (-0.0192,-0.00434]               57                 45
##   (-0.00434,0.00659]               39                 42
##   (0.00659,0.0221]                 57                 61
##   (0.0221,0.244]                    0                  0
##                     Reference
## Prediction           (-0.00434,0.00659] (0.00659,0.0221] (0.0221,0.244]
##   [-0.193,-0.0192]                    0                0              0
##   (-0.0192,-0.00434]                 56               68             60
##   (-0.00434,0.00659]                 36               46             20
##   (0.00659,0.0221]                   57               45             64
##   (0.0221,0.244]                      0                0              0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1673          
##                  95% CI : (0.1414, 0.1959)
##     No Information Rate : 0.2112          
##     P-Value [Acc > NIR] : 0.9989          
##                                           
##                   Kappa : -0.0439         
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: [-0.193,-0.0192] Class: (-0.0192,-0.00434]
## Sensitivity                           0.0000                   0.30405
## Specificity                           1.0000                   0.60165
## Pos Pred Value                           NaN                   0.15734
## Neg Pred Value                        0.7968                   0.77944
## Precision                                 NA                   0.15734
## Recall                                0.0000                   0.30405
## F1                                        NA                   0.20737
## Prevalence                            0.2032                   0.19655
## Detection Rate                        0.0000                   0.05976
## Detection Prevalence                  0.0000                   0.37981
## Balanced Accuracy                     0.5000                   0.45285
##                      Class: (-0.00434,0.00659] Class: (0.00659,0.0221]
## Sensitivity                            0.24161                 0.28302
## Specificity                            0.75662                 0.59764
## Pos Pred Value                         0.19672                 0.15845
## Neg Pred Value                         0.80175                 0.75693
## Precision                              0.19672                 0.15845
## Recall                                 0.24161                 0.28302
## F1                                     0.21687                 0.20316
## Prevalence                             0.19788                 0.21116
## Detection Rate                         0.04781                 0.05976
## Detection Prevalence                   0.24303                 0.37716
## Balanced Accuracy                      0.49912                 0.44033
##                      Class: (0.0221,0.244]
## Sensitivity                         0.0000
## Specificity                         1.0000
## Pos Pred Value                         NaN
## Neg Pred Value                      0.8088
## Precision                               NA
## Recall                              0.0000
## F1                                      NA
## Prevalence                          0.1912
## Detection Rate                      0.0000
## Detection Prevalence                0.0000
## Balanced Accuracy                   0.5000
knn_model4
## k-Nearest Neighbors 
## 
## 1639 samples
##    4 predictor
##    5 classes: '[-0.193,-0.0192]', '(-0.0192,-0.00434]', '(-0.00434,0.00659]', '(0.00659,0.0221]', '(0.0221,0.244]' 
## 
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (20 held-out with a fixed window) 
## Summary of sample sizes: 40, 40, 40, 40, 40, 40, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa     
##   5  0.2574051  0.05650128
##   7  0.2431646  0.03705553
##   9  0.2345253  0.02160727
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.